Semi-automatic selection of summary statistics for 

ABC model choice 



Dennis Prangle*| Paul Fearnhead| Murray P Cox^§ Patrick J Biggs^^ 

and Nigel P French^^ 

February 25, 2013 
Abstract 

A central statistical goal is to choose between alternative explanatory models of 
data. In many modern applications, such as population genetics, it is not possible to 
apply standard methods based on evaluating the likelihood functions of the models, 
as these are numerically intractable. Approximate Bayesian computation (ABC) is 
a commonly used alternative for such situations. ABC simulates data x for many 
parameter values under each model, which is compared to the observed data Xohs- 
More weight is placed on models under which S{x) is close to S'(xobs), where S maps 
data to a vector of summary statistics. Previous work has shown the choice of S is 
crucial to the efficiency and accuracy of ABC. This paper provides a method to select 
good summary statistics for model choice. It uses a preliminary step, simulating many 
X values from all models and fitting regressions to this with the model as response. 
The resulting model weight estimators are used as S in an ABC analysis. Theoretical 
results are given to justify this as approximating low dimensional sufficient statistics. 
A substantive application is presented: choosing between competing coalescent models 
of demographic growth for Campylobacter jejuni in New Zealand using multi-locus 
sequence typing data. 
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1 Introduction 



The increasing availability of modern genetic data offers the possibility of learning 
more than ever before about the processes which generated it, for example the details 
of demographic change. However, for stochastic models that incorporate a high level 
of detail, it is impractically costly to evaluate numerically the probability of a dataset, 
preventing inference by standard likelihood-based methods. This has motivated the 
development of likelihood-free approaches, such as approximate Bayesian computation 
(ABC), which utilise the fact that simulating data from these models is relatively 
computationally cheap. 

There is particular interest in u sing these me t hods to choose between explanatory 
models for observed data. However Robert et al. 



( I2OIII ) illustrated that applying ABC 
to model choice problems can produce highly inaccurate results. This paper provides 
methods to address these concerns and improve the informativeness and efficiency of 
ABC model choice. We focus on a particular application, inferring the demographic 
history of Campylobacter jejuni in New Zealand from population genetic data. This 
will be described in detail later. 

A simple ABC algorithm operates by simulating data sets x under various model 
and parameter pairs {A4.,9). Pairs are accepted when x is sufficiently close to the 
observed data Xobs- This produces a sample of independent draws from an approxi- 
mation to the Bayesian posterior distribution i.e. that of A^, 9\x. Closeness is judged 
by the dis tance betwe e n vectors of summary statist ics ^(xobs) and S{x). Previous 



work (e.g. iBlumI I2OIOI : iFearnhead and Prangld l2012l ) has shown that the quality of 
the approximations produced by ABC algorithms decays rapidly with the dimension 
of 5*. This motivates finding low dimensional summary statistics. However, it is cru- 
cial that the se are also in f ormat ive, as otherwise the problem of inaccurate results 
described by 



Robert et al 



(l201lf l can occur. 

This paper sets out a method to choose S{x) for use in model selection. We give a 
theoretical result showing the existence of a low dimensional vector of statistics suffi- 
cient for model choice (under an appropriate definition given later) . Our method aims 
to estimate such a vector. The idea is to use an extra simulation step to produce many 
(A^, 9, x) triples and then fit simple regression models of M. on x. Predictors from the 
fitted regressions form estimates of low dimensional sufficient statistics, and are used 
as 5* in a main ABC analysis. We refer to the a pproach as the semi-autornatic method 
as it adapts the method of the same name in IFearnhead and Prangld (120121 ) which 
chooses 5* by regressing 6' on x when the aim is inference of continuous parameters. 
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We expect that the targeted sufficient statistics are often comphcated functions 
of the data which are hard to estimate globally. To make the task easier, we advise 
that the regressions are based on data simulated, within each model, from a limited 
subset of parameter values which is judged by preliminary analysis to hold most of that 
model's posterior mass. In other words, the simulation step mentioned above performs 
simulations from the models of interest following a truncation of their parameter 
supports. The resulting S can only be expected to perform well for choice between 
these truncated models. A separate theoretical contribution of the paper is to relate 
results from such a choice to the original model choice problem. 

Our approach o f performing regressions based on simulated data is similar to 



Estoup et al.l ( 120121 ) who instead use linear discriminant analysis. We expect our other 



contributions would also be useful to this approach. Other work on ABC summary 
statistics has focused on validating a particular choice of S. One approach is to 
run ABC analyses on a 
provides accurate results (ISousa et al 



arge number of s i mulated data sets t o check whe t her S 



2012 



Siodin et al.l . l2012h . 



Marin et al 



give a complementary approach, identifying necessary and sufficient properties of S 
for an ABC analysis to be consistent in an asymptotic regime corresponding to highly 
informative data. Essentially, S must have different asymptotic means under the 
models. Given a choice of S, this property can be tested theoretically or through 
simulation. Validation techniques are useful, but not sufficient, to choose 5* for high 
dimensional genetic data where it is infeasible to compare all possible choices of S. 
Our contribution is a method which can be applied in this setting to propose good 
choices of S. 

Ideally the same ABC simulations would be used to provide inference on models 
and also their parameters. The method we present provides summary statistics suit- 
able for model choice only. It would be desirable to augment them with informative 
summaries on model parameters, and we give an approach to do this that is specific 
to our main application. General methods are an interesting topic for future research. 

The remainder of the paper is organised as follows. Section [2] describes ABC 
methods and our notation. Section |3] gives theoretical results on sufficiency, with 
proofs delayed until an appendix. Section HJexplains our semi-automatic ABC method, 
and Section [5] illustrates it for simple examples. The application to Campylobacter 
data is given in Section |6l and the article concludes with a discussion in Section [71 
Further theore t ical a nd simulation results are provided as supplementary material 
( IPrangle et all , boi.sl ). 
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2 Background 



Denote by a random variable which can take values A^i,A^2,---,A^m, represent- 
ing possible models. Let pm be its prior mass function. In an abuse of notation M. 
will also denote a generic value of the variable, with usage clear from the context. 
Each model represents a joint distribution TT{x,d\M.) on the data x and parameters 
6' G 0. This can be written as the product of prior and likelihood terms but we con- 
centrate on the joint form for later convenience and to emphasise that the definition 
of a model includes a parameter prior. Note that it is possible for the parameters 
under each model to belong to different spaces, in which case B is their union, and 
that 9 will also be used to denote both a random variable and generic value. 

Bayesian inference concentrates on Ti{9\x,M.) - the posterior distribution of pa- 
rameters under a specific model - and Pr(A^ \x) - the posterior model probabilities. In- 
ference on models can also be summarised using B ayes factors Bij = n{x\Aii)/Tr{x\J^j)] 
the ratio of the evidences under models A4i and Aij. The Bayes factor does not in- 
volve pm, but incorporating this information allows calculation of the ratio of posterior 
weights: 

FTiM,\x)/FiiMj\x) = B,^pM{Mi)/pM{Mj). 

ABC is used in situations where it is possible to simulate x\M.,0 but evaluation 
of the density 7r(a;|A^,^) is im possible or impractica lly costly. A simple approach to 



ABC inference is Algorithm [T] ( ICrelaud et al.l . 120091 ). 



Input: Observed data Xobs? and a function S{-). 

A threshold h > and a distance function d{-, ■). 
An integer > 0. 

Iterate: For i = 1, . . . , N 

1. Simulate M* from pm{M). 

2. Simulate 6* from TT{e\M*). 

3. Simulate Xsim from 7r(x 16**, A^*). 

4. Accept {M*,9*) if d{S{xoh^), S{x,i^)) < h. 

Output: A set of accepted model and parameter pairs of the form {Ai*, 6*). 



Algorithm 1: Rejection sampling ABC incorporating model choice and parameter 
inference. 
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Letting I represent an indicator function, define 



Pabc{M\S{x))^Pm{M) J n{S{x)\M)I[d{S{xo^,s),S{x)) < h]dx, 
nABciO\M,S{x)) oc n{9) [ n{S{x)\9, M)I[d{S{xohs), Six)] < h)dx. 



Then the sample of (A^, 9) values output by Algorithm [T] is drawn from a distribution 
with conditionals 7rABc(^|-^) 'S'(a;)) and marginal Pabc(-^|'S'(x)). 

In the limit /i — )■ 0, the ABC target distributions just defined converge on Pr(A^ |S'(x)) 
and 'n{9\M.^ S{x)). However, reducing h decreases the output sample size, increas- 
ing Monte Carlo approximation error. A curse of dimensionality result reviewed in 
the supplementary material shows that the rate of increase in error rises with the 
dimension of S. This motivates a low dimensional 5*. It is also important that S is 
informative so that the limiting ABC targets approximate the posterior distributions 
Pr(A^|x) and ti{9\M,x) well. Hence S" is a crucial tuning choice. 

In practice, the results of Algorithm [1] can be highly variable if son ae prior model 
masses are small. Algorithm [2] is a more stable alternative suggested by 



Grelaud et al 



mm . 



As Algorithm [T] except: 
1. Set A^* to A^i, A^2, • • • , A^A/ with equal probability. 



Algorithm 2: A more stable modification of Algorithm [H 



Algorithm 2 samples Ai* values from a uniform distribution rather than pm-, and 
it is necessary to correct the results to take this into account. Let Ui be the number 
of occurrences of Aii in the output sample. Then Ui/uj is an estimator of the Bayes 



factor Bij and niPM{M.i) / Yl!jLif^jPM{-J^j) is an estimator of Vi{J\4.i\S{x)). The 
asymptotic and curs e of dimensionality results outlined above continue to hold. See 



Grelaud et al. 



( 120091 ) and the supplementary material for full details. 



More efficient ABC model choice algorithms have been proposed, mainly based o n 



2010 



Del Moral et al. 



2012). 



sequential Monte Carlo (SMC) (e.g. iToni and Stump! . 
Howev er, the tuning issues just described remain. The SMC algorithm of lToni and Stumpf 
( I2OIOI ) is used later and described in the supplementary material. Another ap- 
proach to improve the quality of ABC results is to post-process them. This uses 
accepted parameters ^*'^, 6**'^, . . ., models A^*'^, A^*'^, . . . and the corresponding simu- 
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lations x*'^, x*'"^ 



2002 



■ For parame ter inference regression adjustment ( iBeaumont et al. 
Blum and Francoisl . 120101 ) fits a model 6 = f{x,e), where / is a determinis- 



tic function and e a random residual, and outputs adjusted values 6^''* = /(xobsjC*). 
Model choice results can be post-proces sed by fittinK a m ultinomial regression model 
Pr(A^|a;) = g{x) and returning ^(a;obs) ( iBeaumontl . l2008[ l . 



3 Theory 

A statistic S{x) of data x is said to be Bayes sufficient for parameter 9 if 9\S(x) and 
9 1 X h ave the same distribution for any prior distribution and almost all x { Kolmogorovi 



19421 ) ■ This is a natural definition of sufficiency for ABC, as it shows that in an ideal 
ABC algorithm with /i — )■ 0, the ABC target distribution equals the correct posterior 
when S is used. Throughout later sections of this paper we use "sufficient" to mean 
Bayes sufficient. Theorem [T] gives an alternative characterisation of Bayes sufficiency 
for M. in the setting described in Section |2l 

Theorem 1 Let T{x) = {Ti{x),T2{x), . . . ,Tm-i{x)} where 

M 

T,{x) = Vi{x\M.;)/^Vi{x\M,)]. 

Then S is Bayes sufficient for M. if and only if there exists a function g such that 
g[S{x)] = T{x) for almost all x. 

Theorem [1] shows that for any situation with M models there are sufficient statis- 
tics for model choice of dimension M — 1, namely the vector T(x). Furthermore, 
vectors S{x) which can be transformed to T{x) are also sufficient. 

Proof See Appendix. 

A sketch of the proof is as follows. The theorem states that Bayes sufficiency of 
S{x) for Ai is equivalent to there being a deterministic transformation from S{x) to 
T{x). The latter vector is M — 1 posterior probabilities given observations x and uni- 
form pm- Under uniform pm, conditioning JH on S{x) satisfying this condition clearly 
recovers the posterior weights. Reweighting can be used to show that the posterior is 
also recovered under any other pm- The converse can be shown by construction. 

One particular sufficient choice of S{x) used later is a vector of all Bayes factors 
under a one-to-one transformation. Additionally, we note that a sufficient S{x) may 
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contain summaries which do not contribute to T{x) but are useful for parameter 
inference. 

Theorem [1] is similar to Theorem 3a of iFearnhead and Prangld (120121 ). which shows 
that for continuous parameters 6, S{x) = E{6\x) is an optimal choice to estimate 
parameter means in terms of minimising quadratic error loss. However this S{x) is 
typically not sufficient for 6. Theorem [1] is a stronger result for the case of model 
choice (or, equivalently, for estimating discrete parameters) showing the existence of 
low dimensional vectors of sufficient statistics. 



4 Method 

The low dimensional sufficient statistics described by Theorem [1] are generally not 
available. However their existence motivates an approach of approximating them 
from simulated data, and then using these approximations as S{x) within ABC, as 
outhned in Algorithm |3l Step 2 requires some user input, as will be described in 
Section H?T| so the method is referred to as "semi-automatic ABC". 

1. Simulate a large number of {J^,6,x) triples. 

2. Calculate S{x) by estimating sufficient statistics from simulations. 

3. Perform the ABC analysis using S{x). 



Algorithm 3: Outline of simple semi-automatic ABC for model choice. Full details of 
the steps are given in Sections 14.11 and 14.21 

Sufficient statistics are likely to be highly complicated functions of the data due 
to the complexity of the models, and thus hard to approximate. To make the task 
more tractable, we recommend some optional extra steps to give Algorithm |H This 
simplifies the models by concentrating on the most likely parameter values. We view 
this as replacing the models 7r(6',x|A^j) with truncated models 

7r{e, x\M'i) oc TT{e, x\Mi)i{e g Ri), (i) 

where Ri is a training region for model Aii. Calculation of 5* is performed using data 
simulated from the truncated models. The resulting S estimates sufficient statistics 
for the choice between the truncated rather than original models. Therefore the main 
ABC analysis must be performed between the truncated models, and, as will be shown 
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in Section 14.2^ the results can be used to estimate the model choice posterior for the 
original problem. 

1. Perform an ABC pilot analysis with ad-hoc summary statistics. Use the output 
for each model to choose training regions Ri of parameters which contain most 
of the posterior probability for each model Aii- 

2. Simulate a large number of {Ai, 9, x) triples using truncated models. 

3. Calculate S{x) by estimating sufficient statistics from simulations. 

4. Perform the ABC main analysis using S{x) and truncated models. 

5. Use truncation correction to estimate posterior probabilities. 

Algorithm 4: Semi-automatic ABC for model choice with truncation steps. Full 
details of the steps are given in Sections 14.11 and 14.21 

The remainder of this section discusses the implementation of the steps in these 
algorithms in more detail. Performance is assessed through simulation examples in 
Section [5l 

4.1 Calculating summary statistics 

This section describes a logistic regression based approach to estimating sufficient 
statistics from simulated training data. A motivating example is the case of two 
models Aii and A^2, with training data drawn from the joint distribution on {Ai, x), 
where ). Define q{x) = Pr(A^i|x). This is clearly a sufficient 

statistic for Ai. Logistic regression can be used to fit 

p 

logit q{x) := log{g(x)/[l - = /3o + /^iXi- (2) 

i=l 

The fitted q{x) is an estimate of a sufficient statistic. Note also that q{x)/[l — q{x)] 
is the Bayes factor multiplied by a constant depending on the prior model weights. 

To improve on the fit of ([2]) and cope with situations where x is very high dimen- 
sional or not a fixed-length vector, in practice we fit instead 

logitg(a;)=/3^/(x), (3) 

where f{x) is a vector of transformation of x, including a constant term. This can 
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perform initial dimension reduction and introduce non-linear functions of the data 
into the regression. Example choices of /(■) used later are 1) order statistics of raw 
data 2) a large number of summaries of genetic sequence data used in previous liter- 
ature, and transformation of these (a constant term is also included in both cases). 
To assist in the choice of /(■)) regression diagnostics can be used, for example to 
compare the quality of the logistic regression fits for some /i(-) and /2(-)- The supple- 
mentary material gives examples in which cross-validation estimates of the deviance 
are compared. 

In general the aim is to calculate S for choice between models A^i, Ai2, ■ ■ ■ ,-Mm, 
which for this discussion may represent original or truncated models. Fix a pair of 
distinct models, A^j and Aij, and consider the subset of training data made up of 
only the simulations from these models. Logistic regression can be used as above to 
estimate the probability qij of A4i\x under the {Ai,x) distribution for this training 
data subset. This is repeated for each pair of distinct models, and results in a vector 
of one-to-one transformations of Bayes factors. This target was shown to be sufficient 
for JH in Section |3l 

The logistic regression method set out above gives dim(S') = M(M — 1)/2, whereas 
Theorem [1] shows there are sufficient statistics of dimension M — 1. Alternative 
regression methods can be used to give dim(S') = M — 1, for example estimating an 
appropriate subset of the Bayes factors or multinomial regression. In this paper we 
consider only examples with M < 3 so the logistic regression approach has limited 
excess dimension. We believe it also aids robustness. Even if the logistic regression 
for one pair of models fits poorly (as is the case in the Campylobacter application), 
the others can still allow a good overall estimate of sufficient statistics. 



4.2 Other steps 

Pilot analysis The pilot ABC analysis uses an ad-hoc choice of summary statistics 
5'piiot- The purpose of the pilot analysis is to roughly identify regions containing 
most of the po sterior mass, so the procedure should be reasonably robust to the 
choice of SpHof iFearnhead and Prangld ( 120121 ) illustrate this argument by example. 



Validation tests could also be performed to test the quality of ABC output from 
analysing simulated data using Spiiot- 

In our implementation the pilot uses an ABC model choice algorithm such as 
Algorithm |2l An alternative approach would be to perform a separate pilot run for 
each model, focusing only on finding training regions, rather than initial model choice 
analysis. We did not investigate this as a pilot analysis incorporating model choice 
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has useful properties. The estimated posterior can serve as a verification that the 
final results appear sensible. Also, if the pilot results are sufficiently convincing in 
showing that certain models are incompatible with the data, they could be ruled out 
at this stage saving computational resources. 

Training region choice The training region Ri for model Ai[ should cover most 
of the posterior mass. Our implementation is to choose a hypercube, with the range 
of each parameter being the interval of sampled values. 

Simulating data We generate training data from the distribution on (A^,6',x) 
defined by the priors and models (or truncated models). An alternative model distri- 
bution can be used without affecting the arguments in Section 14.11 showing that the 
fitted summary statistics are estimates of sufficient statistics. This would be useful if 
some prior model weights are too small to fit all regressions well. 

Truncation correction Results of the main ABC analysis choosing between trun- 
cated models can be used to estimate those for the original model choice problem by 
the following consequence of ([T]): 

Tx{x\Mi) = rin{x\M'i), where = Pr(^ G Ri\Mi) /FiiO e R,\x,Mi). 

That is, the evidence of Aii equals that of multiplied by r^, the ratio of the prior 
and posterior probabilities of Ri. This allows estimation of Bayes factors or posterior 
probabilities for the original models given rj values. As Ri is chosen with the aim 
of containing most of the posterior mass, we estimate its posterior probability by 
1, giving an estimate fj = Pt{6 G Ri\Aii). This prior probability can usually be 
calculated directly when Rj is a hypercube. 



5 Examples 



To illustrate our semi-automatic ABC met hod, we apply it to three simple binar y 



model selection examples from the literature (jPidelot et al. 



2011 



Marin et al. 



,|20l3) 



and extend one of these to a 3 model example. The examples are summarised in Table 
[H The binary examples are the first two models in each letter group, and the 3 model 
example is the full A group. In each case the data are 100 independent draws from 
one of the models and the models have equal prior probabilities. All ABC analyses 
were performed using Algorithm [2j 
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Name 


Model 


Prior 


Al 


Poisson(6') 


6 ~ Exponential(l) 


A2 


Geometric(6') 


9 ~ Uniform(0, 1) 


A Q 

Ao 


Binomial (^iU, U) 


U ~ Betal^i, y j 


Bl 


Laplace (^,l/\/2) 


~ Normal(0, 2^) 


B2 


Normal(6', 1) 


^ ~ Normal(0, 2^) 


CI 


gk(0,l,0,fc) 


A; ~ Unif(-0.5, 5) 


C2 


gk(0,l,(/,fc) 


((7,A;)~Unif([0,4]x [-0.5,5]) 



Table 1: Models used in the examples of S ection [5l For details of the g-and-k distri- 
bution see iRayner and MacGillivrayi (l2002[ l. 



Binary model selection The semi-automatic ABC method of Algorithm S] was 
implemented starting with a pilot analysis using Sio{x) = {x^^\ x^^^\ . . . , x^^^^) where 
a;*-*-* is the zth order statistic. Model choice summary statistics were fitted as described 
in Section W7l\ using /(x) = (1 

(100) -J _ other summaries were added 
for parameter inference. The analysis used 2 x 10^ simulations, one quarter for the 
pilot and the rest used for both summary statistic fitting and the main analysis. 
The pilot and main analysis both accepted 100 simulations. Some alternative ABC 
analyses on the data were performed, each using the same total number of simulations 
and acceptances. Firstly, the analysis was repeated using Algorithm [3l Secondly, 
st andard ABC analys es were performed with Algorithm [2] using (a) S = Siq (b) S as 
in lMarin et al.l ( 2012 ): 4th and 6th moments for B, 10% and 90% quantiles for C. All 
ABC analyses used the following distance 



d{x,y) 



1/2 



(4) 



i.e. Euclidean distance between p-dimensional summary statistics normalised by esti- 
mated standard deviations, cTj. The latter were estimated from the simulated data. 

Figure [U shows estimated posterior probabilities for 5*10 and Algorithm |H Numer- 
ical summar i es of estimation quality are given in Table O This reports the entropic 
loss (IRobertiliggel ). 



100 



^log Pr(mo,i|a;obs,i)> 



i=l 



the negative log probability of the correct models mo,i, . . . , ""^0,100 estimated from the 
corresponding simulated datasets Xobs.i, • • • , a;obs,ioo- Also reported is the misallocation 
rate; the proportion of datasets where the highest weighted model was not the correct 
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model. Our method provides an improvement in all scenarios, although this is modest 
for example C. The use of the truncation steps from AlgorithmlHis shown to sometimes 
be crucial; when Algorithm |3l which omits these, is used instead, the results for 
example C are the worst of all methods. However the effect is problem dependent; 
in example B it made little difference. Exact posterior calculations are possible for 
examples A and B (the required L aplace marginal lik elihood calculations are described 
in Appendix 1 from version 1 of 
provides comparable results. 



Marin et al. 



20121 ). and in both cases Algorithm H] 



Beaumont 



(120081). For 



We attempted to apply post-processing by the method of 
example A this was usually not possible as there was no variation in the accepted 
summaries, which were discrete in this case, or because all acceptances were for a single 
model. For the other examples, it had little effect on entropic loss or misallocation 
rate, so these are not reported. 
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Figure 1: Boxplots of posterior probabilities of model 1 estimated by ABC (without 
post-processing) for 100 simulated datasets in each of three binary model comparison 
examples. The boxplots show quartiles of the values. Within each graph results are 
split by which model generated the data. The top row uses S = Sio, and the second 
row chooses S by semi-automatic ABC (Algorithm H]) . The columns represent three 
model choice examples detailed in Table [TJ 
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Summary statistics 


Binary A 


Example 
Binary B Binary C 


3 models 


'S'lo 

From literature 
From Algorithm [3] 
From Algorithm H] 


33.0 (17%) 

30.2 (14%) 
19.8 (15%) 


33.5 (11%) 
55.3 (25%) 
13.5 (5%) 
13.9 (7%) 


43.0 (16%) 
40.9 (20%) 
oo (21%) 
38.4 (14%) 


70.7 (39%) 

65.9 (42%) 
58.9 (33%) 


Posterior 


19.8 (12%) 


15.6 (8%) 




58.1 (36%) 



Table 2: Entropic loss and misallocation rate (in brackets) from several ABC analyses 
of 100 simulated datasets in each of four model comparison examples, detailed in Table 
[TJ The final row shows values under the exact posterior, where these are available, 
for comparison. 
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Figure 2: Plots of true posterior model weight against ABC estimates for 100 sim- 
ulated datasets in a three model example. Pluses are for S = Sio and crosses for 5* 
chosen by semi-automatic ABC (Algorithm H]) . 
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Selection between three models Algorithms E] and H] were implemented as for 
the two model examples, with the addition that three summary statistics were fitted, 
corresponding to three pairs of models. Figure [2] plots exact posterior probabilities 
against ABC estimates, and shows that Algorithm H] performs better than the com- 
parison analysis using S = Siq. This is confirmed by the quantitative summaries in 
Table [2l which also shows that Algorithm H] outperforms Algorithm |3] and achieves 
comparable results to the true posterior values. Post-processing results are not shown 
because, as mentioned above, they could usually not be calculated for this example. 



6 Application 



Campylobacter jejuni and C. eoli are bac terial pathogens t hat a re a major cause of 



human gastroenteritis around the world (iHumphrey et al. 



20071 ). They are consid- 



ered commensals of a wide variety of animals, including poultry, ruminants and wild 
birds, and human infection occurs as a result of inge sting contarainated food or drink- 
ing water and via direct contact with animal faeces (ISavill et al.l . l2003l ). New Zealand 
has very high rates of campy l obact eriosis and an investigation into the source of hu- 
man infection ( MuUner et al. . 20091 ) has generated a large dataset of isolates from hu- 
mans and anirnals t hat have been characterized by multilocus sequence typing, MLST 
( iDingle et al.l . l200ll ). The dataset of C. jejuni and C. coli isolates from New Zealand 



has been used to inform control policy ( ISears et al.l . 1201 ll ) and to es timate evo 



ary parameters, such as the rates of mutation and recombination ( lYu et al. 



ution- 



20121). 



We focus on the question of demographic history, which is of particular interest in New 
Zealand due to the relatively recent colonization by man a nd the unique pattern of 
animal introductions (both wildlife and domestic animals) ( Atkinson and Cameronl . 
19931 ). We ask: can we detect historic growth in the effective population size, and 
if so, does it correspond to a particular historical event? The relative isolation of 
this location means that neglecting ongoing exchange with outside populations is rea- 
sonably realistic. MLST data are available for over 3000 isolates from a variety of 
hosts. 

We present our methods and results below, with a discussion given in Section 17.21 
Some further details are provided as supplementary material. 
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6.1 Models and priors 



We modelled the C. jejuni data using a coalescent model using the Jukes-Cantor 
model of DNA substitut i on an d incorporating the gene conversion recombination 
Wiuf and HeinI ( 2000 ) with exponential demographic growth, as simula- 



model of 



tion of this scenario is straightforward using existing tools (detailed below). However, 
simulation of a large dataset is prohibitively slow so we used a random subsample of 
100 isolates. Coalescent theory suggests that su ch a sample size captures much of the 
information of the full sample f Nordborgi 2004), and simulation based checks on in- 
formativeness are detailed in the supplementary material. The selected isolates were 
confirmed to be C. jejuni using the PubMLST database and through a phylogeny 
analysis of these isolates and a representative C. coli sequence. Three models were 
considered, with equal prior weights: Model 1 no growth; Model 2 growth for 50 years 
(since expansion of the New Zealand poultry industry); Model 3 growth for 170 years 
(since introduction of European livestock, primarily from Australia and the UK). 

Each model has three biological parameters: a recombination rate, mean track 
length (i.e. length of recombining DNA segment) and mutation rate. Models 2 and 
3 also have a growth parameter. To aid interpretability we parameterised this as the 
relative increase in the effective population size during the period of growth. Prior 
information on parameters is summarised in Table [3l Mutation and recombination 
rates are given per kilobase per 2Np q years, where Np, is the effective population size 



and g the generation length in years. 



Wilson et al. 



( I2OO9I ) estimated the mean time to 



coalescence, NeQ, at 218 with an interval estimate of [155, 288]. To simplify our model, 
we fix NeQ = 218. We expect that variations of N^g within the quoted interval will 
not affect the detection of growth. Mean recombination length is in kilobase units. 
The relative growth parameter is unitless as it is a ratio of effective population sizes. 

Growth priors are based on demographics of the principal host; poultry for model 
2 and sheep/c attle for mode l 3. R ough estimates of host growth rates are used, based 
on the data of iBinney et al.l (l2013[ l. with variance increased to account for uncertainty 



of the link between bacterial and host demo graphics. Biological parameter priors are 

This assumed a no 



Wilson et al 



based on analysis of other C. jejuni data in 
growth model, so these priors may not be appropriate for models 2 and 3. Sensitivity 
analysis detailed in the supplementary material also considers a much less informative 
biological prior. 
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Parameter 


Units 


Model 


Point estimate 


95% CI 


Log normal 
Mean Sd 


Mutation rate 


kb-'-{2Neg)~^ 


All 


13.7 


[8.1,23.2] 


2.62 0.27 


Recombination rate 


kb-\2Neg)-^ 


All 


1.31 


[0.03,51.5] 


0.27 1.87 


Mean track length 


kb 


All 


4.52 


[0.1,209.9] 


1.51 1.96 


Relative growth 




2 


4.06 


[1.5,10.8] 


1.40 0.50 


Relative growth 




3 


33.1 


[2.9,383.8] 


3.50 1.25 



Table 3: Details of the parameter priors used in Section [61 Priors are assumed to be 
the product of a log normal prior for each individual parameter. The point estimates 
are geometric means. The recombination length prior was truncated below 1 base 
pair, and the recombination rate above 25kb~^{2Neg)~^ to avoid excessively slow 
simulations (All estimated posteriors for recombination rate were well below this - see 
Figure 2 of supplementary material.) 



6.2 Methods 



Data sets were simulated using ms (lHudsonl . l2002l ) and seq-gen (IRambaut and Grassly 



19971 ). Genetic summaries required were calculated using R (IR Core Teaml . 120121 ). 
which was also used to code the inference algorithms. 

We implemented semi-automatic ABC (Algorithm H]) as follows. First a pilot 
analysis was performed using the ABC SMC algorithm of iToni and Stumpj (120101 ) 
(detailed in the supplementary material) with 1000 particles. This targeted log- 
transformed parameters, as on the original scale the target distribution is roughly 
log-normal and hard for the algorithm to explore. The summary statistics were a 
set of 15 genetic summaries (these, and other summaries used below, are listed in the 
supplementary material). The distance function was Equation (jl]), Euclidean distance 
between normalised summary statistics, with standard deviations estimated from 100 
datasets simulated from the prior predictive distribution. These simulations were 
also used to choose an initial ABC threshold: the median of the distances between 
these datasets and the observations. In following SMC iterations, the threshold was 
the median of distances for accepted particles in the preceding step. The algorithm 
terminated after the iteration which reached 2 x 10^ simulated data sets. 

To fit summary statistics, 2 x 10^ datasets were simulated using the training re- 
gions. Model choice summaries were fitted as described in Section UT] and summaries 
for parameter inference by linear regression (detailed shortly). For all regressions the 
vector of covariates /(■) consisted of 3 cubic B-sphne bases for each of 125 genetic 
summaries, giving a total of 375 covariates, and a constant term. Spline transfor- 
mations were included to capture non-linear effects. Due to the large number of 
covariates, Li penalised versions of log i stic a nd hnear regression were used, using 
the 'glmnet' R package (IFriedman et al.l . |2010[ ) with the tuning parameter chosen by 
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cross-validation. Cross-validation estimates of fitting error were used to investigate 
which genetic summaries were most informative and to validate many of our modelling 
and tuning choices (details in supplementary material). 

Exploratory analysis showed that for each parameter a single estimator could per- 
form reasonably well under all models (details in supplementary material). To keep 
dim(5') small, our S is the concatenation of such estimators with model choice statis- 
tics. A single hypercube training region was used for all models to prevent behaviour 
of a particular model being overrepresented in any region of parameter space. This 
training region was the product of the parameter ranges from the entire pilot output, 
regardless of model. The regression responses were log-transformed parameters, sup- 
ported by exploratory analysis of Box-Cox transformations. The resulting predictors 
were exponentiated to use in S. Regressions for biological parameters were fitted us- 
ing the simulations from all models, while those for the demographic parameter used 
simulations from the growth models only. 

The final S vector used in the main ABC analysis consisted of four parameter esti- 
mators and three statistics for model choice. The analysis used the distance function 
(jlj) with summary statistic standard deviations estimated from the training data. The 
analysis used the same SMC ABC algorithm as the pilot run, again with 1000 parti- 
cles and targeting log-transformed parameters. The initial threshold was the median 
of distances to the observed data calculated from the training data, with subsequent 
thresholds chosen as in the pilot run. The algorithm terminated after the iteration 
which reached 4 x 10^ simulated data sets. 



6.3 Results 

Table m summarises the model choice results for the pilo t and m ain analyses, including 



the effect of regression post-processing as in iBeaumontI ( l2008l ). They agree in putting 



the majori ty of the weight on model 1, the no growth model. Effective sample sizes 



( iLid . 119961 ) show that Monte Carlo error is approximately equal to that of a moderately 
large independent sample. The supplementary material details sensitivity analyses 
which vary the parameter priors and the subsample of isolates used as observations. 
With the exception of some pilot analyses, the weight placed on model 1 remains 
in the range 80 — 100%. ABC analyses of simulated datasets are also described in 
the supplementary material. Although only a small number were possible due to 
the high computational cost, the results suggest that the analyses are capable of 
distinguishing the no-growth from the growth models, with the main analysis doing 
so more accurately. 
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Analysis 




Post-processed? 


Model i 


Model I 


Model 3 


Pilot 


348 


No 


0.86 


0.11 


0.04 


Yes 


1.00 


0.00 


0.00 


Main 


600 


No 
Yes 


0.96 
0.92 


0.03 
0.03 


0.01 
0.05 



Table 4: Estimated posterior probabilities and effective sample sizes from ABC anal- 
yses on Campylobacter data. 



Table [5] summarises the parameter inference results. Marginal density plots are 



provided in the suppleme ntary material. 
regression adjustment of 



Beaumont et al. 



he tab le includes results from applying the 
(I2OO2I ) to model 1 output. This was not 



applied to other models as there were too few accepted particles to expect it to be 
stable. The most notable finding is the low estimate of recombination rate, discussed 
further in Section 17.21 Additionally, informative estimates are made for mutation rate 
and relative growth. The latter concentrates on low values, providing further evidence 
against significant growth. Sensitivity analyses detailed in the supplementary material 
support these qualitative findings, although the numerical values are less robust than 
those for model choice. 





Recombination rate Mean track length Mutation rate Relative growth 
kb-'^{2Neg)-^ kb kb-^{2Neg)-^ 


Model 1 
Prior Model 2 
Model 3 


1.31 [0.03, 51.5] 4.52 [0.1, 209.9] 13.7 [8.1, 23.2] 

1.31 [0.03, 51.5] 4.52 [0.1, 209.9] 13.7 [8.1, 23.2] 4.06 [1.5, 10.8] 

1.31 [0.03, 51.5] 4.52 [0.1, 209.9] 13.7 [8.1, 23.2] 33.1 [2.9, 383.8] 


Model 1 
p.. Model 1 (adjusted) 
Model 2 
Model 3 


0.34 [0.02, 5.21] 2.43 [0.06, 88.2] 11.4 [7.63, 16.7] 

0.18 [0.02, 1.87] 1.04 [0.05, 24.8] 12.8 [10.2, 16.7] 

0.28 [0.02, 2.45] 1.99 [0.09, 24.1] 12.6 [8.76, 17.2] 2.12 [1.07, 3.07] 

0.17 [0.01, 0.78] 1.58 [0.08, 94.9] 12.2 [8.80, 15.1] 4.81 [0.97, 19.0] 


Model 1 
Model 1 (adjusted) 
Model 2 
Model 3 


0.55 [0.02, 3.74] 5.81 [0.17, 239.2] 12.9 [10.1, 16.5] 

0.22 [0.02, 1.18] 2.98 [0.22, 63.2] 13.0 [10.6, 15.9] 

0.24 [0.01, 3.53] 5.73 [0.52, 239] 14.0 [11.6, 16.5] 1.51 [0.85, 2.71] 

0.34 [0.01, 3.37] 3.08 [0.40, 128] 12.6 [9.81, 16.4] 1.12 [0.41, 2.44] 



Table 5: Parameter point estimates (geometric means) and 95% credible intervals 
from prior and ABC analyses on Campylobacter data. 



The regression and ABC results were also used to find which genetic summaries 
were particularly informative, and to show that some aspects of the data fitted poorly 
under any model. These results are given in the supplementary material, and can 
inform future modelling and analyses. 
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7 Discussion 



7.1 ABC Methodology 

It is often desirable to perform model choice and parameter inference using the same 
simulations. Our methodology focuses on producing S appropriate for model choice 
only. Section [6] contains an application-specific example of adding a small number 
of further summaries to S which are informative for parameter inference. General 
purpose methods to choose such low dimensional summaries would be useful. How- 
ever, often each model may require separate summaries, so that a choice of S suitable 
for model choice and parameter inference would be high dimensional. An alternative 
strategy is to develop ABC methods in which comparisons of simulated and observed 
data do not always use the same summaries. A simple approach would be to perform 
separate rejection sampling analyses for model choice and for parameter inference un- 
der each model. A possible alternative is an MCMC algorithm which moves between 
models using only summaries relevant to the model(s) involved in the current step. 

There are numerous alternatives to logistic regressio n to fit sumrnary st atistics 
for model choice, such as linear discriminant analysis ( lEstoup et al.l . 120121 ) and a 
comparison of their performance within ABC may be interesting. Other parts of 
our semi-automatic method could also be varied. For example, our choice of S is 
a vector of one-to-one transformations of Bayes factors, and other transformations 
may perform differently. Also, other methods could produce a more accurate training 
region, such as fitting a flexible model to the pilot output. 

For simplicity we have used relatively simple ABC algorithms. However, much 
prog ress is being made in im proving algorithmic efficiency, especially of ABC SMC 



Del Moral et al 



20121 ). Our work is complementary to this and it could be 



(e.g. 

used with many such improved algorithms. Indeed ABC SMC algorithms can also be 
modified to incorporate semi-automatic ABC. For example, recall that in Section [5] 
the training dat a were reused as the simulations needed for ABC rejection sampling. 
As suggested by Barnes et al.l (2012), in ABC SMC they could be similarly reused for 
the first SMC iteration. 



7.2 Campylobacter application 

Our main finding is support for a model with no change in the effective population 
size of C. jejuni. This is surprising over a period where its ecological niche has greatly 
increased. Analysis in the supplementary material shows some features of the data are 
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poorly fitted under all models, suggesting that more detailed demographic structure is 
necessary to fit the data well. One potential modification is subpopulation structure 
amongst the hosts which might reveal that only some support growing C. jejuni 
populations. 

Our analysis also produced parameter estimates. Those for mutation rate and 
mean length of recombination tract are comparable to those from oth er work. The 



point estimates of recombination rate are somewhat smaller than those of lWilson et al. 



(120091 ). who performed a similar ABC analysis on a different dataset. Furthermore our 



credib le intervals a r e mu ch n arrower, and exc lude the estimates of 
feoosf ). biggs et all J2OI1I ) and 



Fearnhead et al. 



Yu et al. 



( I2OI2I ) , who find recombi nation and mutation 



Wilson et al. 



( 120091 ) 



rates to be of the same order of magnitude. The discrepancy with 
is conceivably due to their use of a heavy tailed prior or ABC tuning differences such 
as choice of threshold. The others suggest differences in the model or data used. 



Yu et al. 



( I2OI2I ). their analysis, and that of iBiggs et al. 



For ex ample, as discussed by 

({2011 ). is for closely related sequences, and may reveal a high level of recombination 



that is then removed by purifying selection. 
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Appendix: Proof of Theorem [T] 

Bayes sufficiency of S{x) for Ai is equivalent to the following being true for all i and 
Pm, and almost any x, 

Vi{Mi\S{x)) = Fi{Mi\x). (5) 

For convenience we introduce p = {pM{-Mi))i<i<M to represent the prior mass func- 
tion. Also, let 1 be a vector of M components equal to 1. 

First assume S is Bayes sufficient for A^. Define hi{S{x),p) = PTp{M.i\S{x)) 
(i.e. the conditional probability under prior p) and note hi{S{x),p) = Prp(A^j|a;). 
The required function is g{S{x)) = {hi{S{x), M~^l))i<i<j^,^ -^. 
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It remains to prove Bayes sufficiency for M. given a function g of the form de- 
scribed in the theorem. Henceforth we consider only the case p = M~^l, since 
in this case ([5]) is equivalent to Pr(a;|A^j) = kPT{S{x)\Mi) for some constant k, 
and applying Bayes' theorem to this proves ([5]) for general p. It also suffices to 
show that ([5]) holds for all i < M; the case i = M follows as probabilities sum 
to 1. Fix some i < M and define an indicator variable Y = = Mi]- Then 

Ti{x) = Fi{Mi\x) = E[Y\x] and FT{Mi\S{x)) = E[Y\S{x)]. To prove we will 
show that these conditional expectations are almost always equal. Standard proper- 
ties of conditional expectation give -^[^[^(a;)] = £^[i?{F|x}|5'(x)] = E[Ti{x)\S{x)]. 
Finally, E[Ti{x)\S{x)] = E[gi{S{x))\S{x)] = gi{S{x)) = Ti{x) = E[Y\x] for almost all 
X as required, where gi{-) represents the ith component of the g{-) function. 
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